This is me working through relevant exercises in R for Data Science by Garrett Grolemund and Hadley Wickham.
filter()Load packages, data.
library(nycflights13)
library(tidyverse)
Select flights that happened on January 1, 2013.
jan1 <- filter(flights, month == 1, day == 1)
jan1
# alternatively, enclose in parens
(jan1 <- filter(flights, month == 1, day == 1))
When dealing with floating point numbers (decimals), use near() rather than == for logical statements.
sqrt(2)^2 == 2
## [1] FALSE
near(sqrt(2)^2, 2)
## [1] TRUE
Instead of writing long disjunctives, use %in%.
nov_dec <- filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
DeMorgan’s Law:
!(x & y) == (!x | !y)!(x | y) == (!x & !y)If you want to include rows that are NA, you have to do so explicitly (for filter()).
df <- tibble(x = c(1, NA, 3))
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
delay <- filter(flights, arr_delay >=120)
summary(delay$arr_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120.0 137.0 163.0 184.2 207.0 1272.0
houston <- filter(flights, dest %in% c("IAH", "HOU"))
table(houston$dest)
##
## HOU IAH
## 2115 7198
airlines_filt <- filter(flights, carrier %in% c("UA", "AA", "DL"))
table(airlines_filt$carrier)
##
## AA DL UA
## 32729 48110 58665
summer <- filter(flights, month %in% c(7,8,9))
table(summer$month)
##
## 7 8 9
## 29425 29327 27574
arrdelay <- filter(flights, dep_delay <=0 & arr_delay > 120)
arrdelay
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 27 1419 1420 -1 1754
## 2 2013 10 7 1350 1350 0 1736
## 3 2013 10 7 1357 1359 -2 1858
## 4 2013 10 16 657 700 -3 1258
## 5 2013 11 1 658 700 -2 1329
## 6 2013 3 18 1844 1847 -3 39
## 7 2013 4 17 1635 1640 -5 2049
## 8 2013 4 18 558 600 -2 1149
## 9 2013 4 18 655 700 -5 1213
## 10 2013 5 22 1827 1830 -3 2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
makeup <- filter(flights, dep_delay >= 60 & (arr_delay < dep_delay - 30))
morning <- filter(flights, dep_time >= 0 & dep_time <= 600)
summary(morning$dep_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 540.0 554.0 488.7 557.0 600.0
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?between() allows you to select a range for a value (e.g., between 7 and 9). It is inclusive on both sides.
summer <- filter(flights, between(month, 7, 9))
morning <- filter(flights, between(dep_time, 0, 600))
dep_time? What other variables are missing? What might these rows represent?missing_dep <- filter(flights, is.na(dep_time))
missing_dep
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
These flights (n = 8255) are missing actual departure and arrival times, as well as air time and sometimes tail number. These are probably filghts that were cancelled.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)Most of those examples don’t take the NA into account – anything to the zero-th is 1, anything OR TRUE is true, anything and FALSE is false. As for NA * 0 – if NA was Inf, it would be NA. I think this basically just means that you should check how different functions use NA before using them on missing data.
arrange()Adding multiple columns is like saying THEN: arrange by year THEN month THEN day.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
desc() does descending order.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values are at the end (like Rstudio Viewer does it).
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
is.na()).df <- tibble(x = c(5, 2, NA))
arrange(df, desc(is.na(x)))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 NA
## 2 5
## 3 2
arrange(flights, desc(dep_delay), dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, air_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 16 1355 1315 40 1442
## 2 2013 4 13 537 527 10 622
## 3 2013 12 6 922 851 31 1021
## 4 2013 2 3 2153 2129 24 2247
## 5 2013 2 5 1303 1315 -12 1342
## 6 2013 2 12 2123 2130 -7 2211
## 7 2013 3 2 1450 1500 -10 1547
## 8 2013 3 8 2026 1935 51 2131
## 9 2013 3 18 1456 1329 87 1533
## 10 2013 3 19 2226 2145 41 2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Woooo Newark to Bradley – 20 minutes!
arrange(flights, distance)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2 2222
## 3 2013 1 4 1240 1200 40 1333
## 4 2013 1 4 1829 1615 134 1937
## 5 2013 1 4 2128 2129 -1 2218
## 6 2013 1 5 1155 1200 -5 1241
## 7 2013 1 6 2125 2129 -4 2224
## 8 2013 1 7 2124 2129 -5 2212
## 9 2013 1 8 2127 2130 -3 2304
## 10 2013 1 9 2126 2129 -3 2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Lol, there’s a cancelled flight between Newark and LaGuardia. Otherwise it’s Newark to Philly as the shortest.
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 2 909 900 9 1525
## 3 2013 1 3 914 900 14 1504
## 4 2013 1 4 900 900 0 1516
## 5 2013 1 5 858 900 -2 1519
## 6 2013 1 6 1019 900 79 1558
## 7 2013 1 7 1042 900 102 1620
## 8 2013 1 8 901 900 1 1504
## 9 2013 1 9 641 900 1301 1242
## 10 2013 1 10 859 900 -1 1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Looks like it’s a flight between JFK and Honolulu, which is almost 5,000 miles. It’s only about 9 hours!
select()Select lets you pick out columns.
# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Helper functions:
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters.num_range("x", 1:3): matches x1, x2 and x3.everything(): selects columns you haven’t mentioned yet.# moves time_hour and air_time to the front of the tibble
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
rename() is useful to rename columns.
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, c(4,6,7,9))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, year, year, day, year)
## # A tibble: 336,776 x 2
## year day
## <int> <int>
## 1 2013 1
## 2 2013 1
## 3 2013 1
## 4 2013 1
## 5 2013 1
## 6 2013 1
## 7 2013 1
## 8 2013 1
## 9 2013 1
## 10 2013 1
## # ... with 336,766 more rows
It only returns it the one time.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
It’s kind of line %in% for select – if the column matches one of the names in the vector, it is returned.
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>
It looks like case doesn’t matter for select. You can change it wil ignore.case = FALSE.
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 x 0
mutate()Mutate makes new variables and adds them to the end of the tibble. You can use these new variables to create even more variables all in the same line.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 11
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 370.
## 2 2013 1 1 4 20 1416 227 -16 374.
## 3 2013 1 1 2 33 1089 160 -31 408.
## 4 2013 1 1 -1 -18 1576 183 17 517.
## 5 2013 1 1 -6 -25 762 116 19 394.
## 6 2013 1 1 -4 12 719 150 -16 288.
## 7 2013 1 1 -5 19 1065 158 -24 404.
## 8 2013 1 1 -3 -14 229 53 11 259.
## 9 2013 1 1 -3 -8 944 140 5 405.
## 10 2013 1 1 -2 8 733 138 -10 319.
## # ... with 336,766 more rows, and 2 more variables: hours <dbl>,
## # gain_per_hour <dbl>
If you only want to keep the new variables, use transmute().
transmute(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 4
## gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl>
## 1 -9 370. 3.78 -2.38
## 2 -16 374. 3.78 -4.23
## 3 -31 408. 2.67 -11.6
## 4 17 517. 3.05 5.57
## 5 19 394. 1.93 9.83
## 6 -16 288. 2.5 -6.4
## 7 -24 404. 2.63 -9.11
## 8 11 259. 0.883 12.5
## 9 5 405. 2.33 2.14
## 10 -10 319. 2.3 -4.35
## # ... with 336,766 more rows
You can use any function with mutate() as long as it is vectorized: it must take a vector of values as input, return a vector with the same number of values as output.
Useful functions:
%/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y)lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.min_rank(). If min_rank() doesn’t do what you need, look at the variantsrow_number(),dense_rank(),percent_rank(),cume_dist(),ntile()`dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.flights_sml <- select(flights, contains("dep"))
mutate(flights_sml,
dep_min_since_midn = (dep_time %/% 100) * 60 + (dep_time %% 100),
sched_dep_min_since_midn = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time dep_delay dep_min_since_midn sched_dep_min_sin…
## <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 2 317 315
## 2 533 529 4 333 329
## 3 542 540 2 342 340
## 4 544 545 -1 344 345
## 5 554 600 -6 354 360
## 6 554 558 -4 354 358
## 7 555 600 -5 355 360
## 8 557 600 -3 357 360
## 9 557 600 -3 357 360
## 10 558 600 -2 358 360
## # ... with 336,766 more rows
air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?mutate(select(flights, arr_time, dep_time, air_time),
tot_time = arr_time - dep_time)
## # A tibble: 336,776 x 4
## arr_time dep_time air_time tot_time
## <int> <int> <dbl> <int>
## 1 830 517 227 313
## 2 850 533 227 317
## 3 923 542 160 381
## 4 1004 544 183 460
## 5 812 554 116 258
## 6 740 554 150 186
## 7 913 555 158 358
## 8 709 557 53 152
## 9 838 557 140 281
## 10 753 558 138 195
## # ... with 336,766 more rows
I think I have to change arr_time and dep_time to continuous values.
mutate(select(flights, dep_time, arr_time, air_time),
dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
arr_cont = (arr_time %/% 100) * 60 + (arr_time %% 100),
tot_time = arr_cont - dep_cont)
## # A tibble: 336,776 x 6
## dep_time arr_time air_time dep_cont arr_cont tot_time
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 517 830 227 317 510 193
## 2 533 850 227 333 530 197
## 3 542 923 160 342 563 221
## 4 544 1004 183 344 604 260
## 5 554 812 116 354 492 138
## 6 554 740 150 354 460 106
## 7 555 913 158 355 553 198
## 8 557 709 53 357 429 72
## 9 557 838 140 357 518 161
## 10 558 753 138 358 473 115
## # ... with 336,766 more rows
This does not deal with flights that take off before midnight one night and land the next day. I would probably have to filter for those rows and do something different (maybe using lubridate?).
dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?dep <- select(flights, dep_time, sched_dep_time, dep_delay)
mutate(dep,
dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
sched_dep_cont = (sched_dep_time %/% 100)* 60 + (sched_dep_time %% 100),
calc_dep_delay = dep_cont - sched_dep_cont)
## # A tibble: 336,776 x 6
## dep_time sched_dep_time dep_delay dep_cont sched_dep_cont
## <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 2 317 315
## 2 533 529 4 333 329
## 3 542 540 2 342 340
## 4 544 545 -1 344 345
## 5 554 600 -6 354 360
## 6 554 558 -4 354 358
## 7 555 600 -5 355 360
## 8 557 600 -3 357 360
## 9 557 600 -3 357 360
## 10 558 600 -2 358 360
## # ... with 336,766 more rows, and 1 more variable: calc_dep_delay <dbl>
dep_delay is a subtraction between continuous versions of dep_time andsched_dep_time`.
delay <- select(flights, arr_delay)
arrange(mutate(delay,
delay_rank = min_rank(arr_delay)), desc(delay_rank))
## # A tibble: 336,776 x 2
## arr_delay delay_rank
## <dbl> <int>
## 1 1272 327346
## 2 1127 327345
## 3 1109 327344
## 4 1007 327343
## 5 989 327342
## 6 931 327341
## 7 915 327340
## 8 895 327339
## 9 878 327338
## 10 875 327337
## # ... with 336,766 more rows
1:3 + 1:10 return? Why?1:3 + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
It produces longer object length is not a multiple of shorter object length. This is because we’re trying to add two vectors of different lengths, not all of the objects within them.
R has the following:
cos(x)
sin(x)
tan(x)
acos(x)
asin(x)
atan(x)
atan2(y, x)
# cospi(x), sinpi(x), and tanpi(x), compute cos(pi*x), sin(pi*x), and tan(pi*x)
cospi(x)
sinpi(x)
tanpi(x)
summarise()summarise() collapses a data frame to a single row. It’s especially useful for grouped summaries.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
To make this code easier, we can use the pipe %>%.
# by destination, count the number of flights, mean distance, and mean arrival delay.
## only use locations with more than 20 flights that are not HNL.
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
For most aggregation functions, if there is any missing data it will return NA. To avoid this, us na.rm = TRUE.
Also, it’s useful to always include a count n() or count of non-missing values sum(!is.na(x)) when looking at aggregated data, to make sure you’re not drawing conclusions on a small amount of data.
Useful summary functions:
mean(x), median(x)sd(x), IQR(x), mad(x) (last two robust to outliers)min(x), quantile(x, 0.25), max(x)first(x), nth(x, 2), last(x)n(), sum(!is.na(x)), n_distinct(x) (unique values)sum(x > 10), mean(y == 0). Here, sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.There’s also a special function for counts that takes a weighting variable if needed.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
# sums total number of miles a given plane flew
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
ungroup() reverses group_by().
For the 50/50 flights, if we just take the mean of the delay we will get 0, which implies that the flight is basically always on time – but that’s not true. Instead we probably want to have some kind of measure of the variability in delay time as well as the mean delay time.
not_cancelled %>%
group_by(carrier, flight) %>%
summarise(count = n(),
m_delay = mean(arr_delay, na.rm = TRUE),
sd_delay = sd(arr_delay, na.rm = TRUE)) %>%
filter(count > 25) %>%
ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
geom_point(alpha = 0.5) + theme_bw()
Here we can see that the most consistent flights are also the ones that tend to arrive early. We can do the same thing with dep_delay.
not_cancelled %>%
group_by(carrier, flight) %>%
summarise(count = n(),
m_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE)) %>%
filter(count > 25) %>%
ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
geom_point(alpha = 0.5) + theme_bw()
not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).# number of flights by destination
not_cancelled %>%
group_by(dest) %>%
summarise(count = n())
## # A tibble: 104 x 2
## dest count
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
# total number of miles by plane
not_cancelled %>%
group_by(tailnum) %>%
summarise(tot_mi = sum(distance, na.rm = TRUE))
## # A tibble: 4,037 x 2
## tailnum tot_mi
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
(is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?I think in this case the arr_delay column is more important, because flights can be delayed multiple times before they are actually cancelled. dep_time might just be the latest time the flight was expected to take off before it was cancelled. If arr_delay or arr_time is missing, this implies that the flight never landed in its destination. This could mean one of a few things:
In the case of the second option, we don’t have information about where a diverted flight actually did land, so for our purposes those flights are as good as cancelled. So, it seems like the best option here is to only use flights that have comlete data for arr_delay.
flights %>%
group_by(month, day) %>%
summarise(count = n(),
cancelled = sum(is.na(arr_delay)),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = cancelled, y = mean_delay)) +
geom_point(alpha = 0.5) + theme_bw() + geom_smooth(se = FALSE)
Intuitively, it would make sense that days with worse weather would have both more cancelled flights and more delayed flights. It seems like it’s a U-shaped curve – likely, once the weather gets so bad, all the flights get cancelled rather than some delayed and some cancelled.
flights %>% group_by(carrier, dest) %>% summarise(n()))flights %>%
group_by(carrier, dest) %>%
summarise(count = n(),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
## # A tibble: 314 x 4
## # Groups: carrier [16]
## carrier dest count mean_delay
## <chr> <chr> <int> <dbl>
## 1 UA STL 2 110
## 2 OO ORD 1 107
## 3 OO DTW 2 68.5
## 4 UA RDU 1 56
## 5 EV CAE 113 42.8
## 6 EV TYS 323 41.2
## 7 EV PBI 6 40.7
## 8 EV TUL 315 33.7
## 9 EV OKC 346 30.6
## 10 UA JAC 23 29.9
## # ... with 304 more rows
flights %>%
group_by(dest) %>%
summarise(count = n(),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
## # A tibble: 105 x 3
## dest count mean_delay
## <chr> <int> <dbl>
## 1 CAE 116 41.8
## 2 TUL 315 33.7
## 3 OKC 346 30.6
## 4 JAC 25 28.1
## 5 TYS 631 24.1
## 6 MSN 572 20.2
## 7 RIC 2454 20.1
## 8 CAK 864 19.7
## 9 DSM 569 19.0
## 10 GRR 765 18.2
## # ... with 95 more rows
Looks like ExpressJet Airlines Inc. has pretty bad delays at lots of airports – CAE, TYS, PBI, TUL, OKC. United and SkyWest are also not great.
sort() sorts the output in descending order. This might be helpful if you want to look at the head of a table without using arrange().
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum) has the worst on-time record?
flights %>%
mutate(canceled = is.na(arr_time),
late = !canceled & arr_delay > 0) %>%
filter(canceled == FALSE) %>%
group_by(tailnum) %>%
summarise(count = n(),
n_delay = sum(late, na.rm = TRUE),
prop_delay = n_delay/count) %>%
mutate(prop_rank = min_rank(desc(prop_delay))) %>%
filter(count > 5) %>%
arrange(prop_rank)
## # A tibble: 3,575 x 5
## tailnum count n_delay prop_delay prop_rank
## <chr> <int> <int> <dbl> <int>
## 1 N337AT 13 12 0.923 105
## 2 N169AT 11 10 0.909 106
## 3 N168AT 18 16 0.889 107
## 4 N261AT 9 8 0.889 107
## 5 N284AT 9 8 0.889 107
## 6 N290AT 16 14 0.875 110
## 7 N551NW 7 6 0.857 111
## 8 N273AT 13 11 0.846 112
## 9 N299AT 6 5 0.833 113
## 10 N388SW 6 5 0.833 113
## # ... with 3,565 more rows
This table summarizes the planes with the highest percentage of delayed flights (excl. cancelled flights). I’ve only included planes who have flown at least 5 times, since many planes have flown only once and were delayed. This shows that plane N337AT was delayed on 12/13 of its flights in 2013.
flights %>%
group_by(hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = avg_delay)) + geom_bar(stat = "identity") + theme_bw()
In this question I’m using dep_delay instead of arr_delay because it seems like you would want to avoid delays at the gate when choosing flight time. It looks like early morning (5am) is the best time – average delay is less than 1 minute! This makes sense, because delays accumulate throughout the day as flights come in late and then leave late.
flights %>%
filter(!is.na(arr_delay), arr_delay > 0) %>%
group_by(dest) %>%
mutate(total_delay = sum(arr_delay),
prop_delay = arr_delay / total_delay) %>%
select(flight, dest, total_delay, prop_delay)
## # A tibble: 133,004 x 4
## # Groups: dest [103]
## flight dest total_delay prop_delay
## <int> <chr> <dbl> <dbl>
## 1 1545 IAH 99391 0.000111
## 2 1714 IAH 99391 0.000201
## 3 1141 MIA 140424 0.000235
## 4 1696 ORD 283046 0.0000424
## 5 507 FLL 202605 0.0000938
## 6 301 ORD 283046 0.0000283
## 7 194 LAX 203226 0.0000344
## 8 707 DFW 110009 0.000282
## 9 4650 ATL 300299 0.0000400
## 10 4401 DTW 138258 0.000116
## # ... with 132,994 more rows
flights %>%
arrange(origin, month, day, dep_time) %>% # arrange in order
group_by(origin) %>% # group by airport to avoid cross-airport comparisons
mutate(prev_delay = lag(dep_delay)) %>% # create lagged delay column
group_by(dep_delay) %>% # otherwise, they'll have the same value -- this preserves lag
summarise(m_prev_delay = mean(prev_delay, na.rm = TRUE)) %>% # find mean delay for previous flight
ggplot(aes(dep_delay, m_prev_delay)) + geom_point() + theme_bw() + xlab("Current Flight Delay") + ylab("Prev Flight Delay")
Here we can see that there is a pretty strong relationship between the previous flight’s delay and the current flight’s delay until we get to about a 5 hour delay, at which point anything goes.
flights %>%
group_by(origin, dest) %>%
mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
select(month, day, flight, origin, dest, air_time, med_air_time) %>%
mutate(diff = air_time - med_air_time) %>%
arrange(diff)
## # A tibble: 336,776 x 8
## # Groups: origin, dest [224]
## month day flight origin dest air_time med_air_time diff
## <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 7 2 4667 EWR MSP 93 149 -56
## 2 7 14 673 EWR SNA 274 329 -55
## 3 9 6 771 JFK LAX 275 329 -54
## 4 7 3 161 JFK SEA 275 328 -53
## 5 7 12 21 JFK LAX 277 329 -52
## 6 7 13 485 EWR SNA 279 329 -50
## 7 7 15 329 EWR SNA 279 329 -50
## 8 8 27 797 JFK LAX 279 329 -50
## 9 7 15 712 JFK LAX 280 329 -49
## 10 9 28 15 EWR HNL 562 611 -49
## # ... with 336,766 more rows
If we compare the air time for each flight to the median air time for that origin-destination pair, we see that the fastest flights are only about an hour quicker than a normal flight between those airports. This doesn’t necessarily indicate an error in data entry. You can note that most of these quick flights happen in July – maybe when the weather is pretty clear.
flights %>%
group_by(origin, dest) %>%
mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
select(month, day, flight, origin, dest, air_time, med_air_time) %>%
mutate(diff = air_time - med_air_time) %>%
arrange(desc(diff))
## # A tibble: 336,776 x 8
## # Groups: origin, dest [224]
## month day flight origin dest air_time med_air_time diff
## <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 7 28 841 JFK SFO 490 347 143
## 2 1 28 575 JFK EGE 382 255 127
## 3 11 22 426 JFK LAX 440 329 111
## 4 9 10 745 LGA DEN 331 227 104
## 5 11 22 587 EWR LAS 399 298 101
## 6 6 29 1491 JFK ACK 141 41 100
## 7 1 2 4204 EWR OKC 286 190 96
## 8 1 15 4204 EWR OKC 284 190 94
## 9 7 10 17 JFK LAX 422 329 93
## 10 12 6 434 JFK SFO 438 347 91
## # ... with 336,766 more rows
If we look at the opposite, we can see some flights that were in the air a lot longer than average. For example, the flight on 7/28 from JFK to SFO was in the air over 2 hours more than average.
airports_carriers <- flights %>%
group_by(dest) %>%
summarise(n_carriers = n_distinct(carrier)) %>%
filter(n_carriers > 2) %>%
arrange(n_carriers)
flights %>%
filter(dest %in% airports_carriers$dest) %>%
group_by(carrier) %>%
summarise(n_dest = n_distinct(dest)) %>%
arrange(desc(n_dest))
## # A tibble: 15 x 2
## carrier n_dest
## <chr> <int>
## 1 DL 37
## 2 EV 36
## 3 UA 36
## 4 9E 35
## 5 B6 30
## 6 AA 17
## 7 MQ 17
## 8 WN 9
## 9 OO 5
## 10 US 5
## 11 VX 3
## 12 YV 3
## 13 FL 2
## 14 AS 1
## 15 F9 1
It looks like Delta flies to the most multi-carrier airports, followed by ExpressJet and United.
flights %>%
filter(!is.na(air_time)) %>%
arrange(tailnum, month, day, dep_time) %>%
mutate(one_hour = ifelse(dep_delay > 60, 1,0)) %>%
group_by(tailnum) %>%
mutate(cumsum = cumsum(one_hour)) %>%
filter(cumsum == 0) %>%
summarise(count = n())
## # A tibble: 3,804 x 2
## tailnum count
## <chr> <int>
## 1 N0EGMQ 53
## 2 N10156 9
## 3 N102UW 25
## 4 N103US 46
## 5 N104UW 3
## 6 N105UW 22
## 7 N107US 20
## 8 N108UW 36
## 9 N109UW 48
## 10 N110UW 15
## # ... with 3,794 more rows
Two main questions for EDA:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Variation is how a variable changes between measurements.
Categorical variables can be visualized using a bar chart.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) + theme_bw()
Continuous variables can be visualized using a histogram.
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) + theme_bw()
To view multiple distributions of continuous variables at once, use geom_freqpoly().
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1) + theme_bw()
Things to look out for: * Typical values: Why are some values more common than others? Why are some values non-existent in the dataset? Why do some values seem to cluster together? * Unusual values: What outliers exist? Does your analysis change if you exclude them?
library(gridExtra)
x <- ggplot(diamonds, aes(x)) + geom_histogram() + theme_bw()
y <- ggplot(diamonds, aes(y)) + geom_histogram() + theme_bw()
z <- ggplot(diamonds, aes(z)) + geom_histogram() + theme_bw()
grid.arrange(x,y,z,ncol=1)
The documentation says that x is length, y is width, and z is depth. This suggests that most of the diamonds are not cut in round or square shapes where the length and width would be very similar. More information about the cut shape of the diamonds would help us determine this wihout the documentation.
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw()
Nothing super weird about this plot right now.
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()
Making the bind size smaller shows us a weird gap somewhere between 0 and 2500 dollars. Let’s look at that.
lessthanthreek <- filter(diamonds, price < 3000)
ggplot(lessthanthreek, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()
There’s a very specific range of prices around $1500 where there are no cases. That’s weird!
diamonds %>%
filter(carat == 0.99 | carat == 1) %>%
group_by(carat) %>%
summarise(count = n())
## # A tibble: 2 x 2
## carat count
## <dbl> <int>
## 1 0.99 23
## 2 1 1558
There are a lot more diamonds that are 1 carat than those that are 0.99. This is probably because people want to buy a 1-carat diamond (as opposed to a 0.99) – jewelers might be rounding up from 0.99 to 1.
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + coord_cartesian(xlim = c(200, 1000))
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + xlim(200, 1000)
It seems like coord_cartesian is simply zooming in on the larger plot – the data extends past the last tick mark, and there is only one bin (because the 30 total bins come from the entire dataset). However, xlim (and presumably ylim) subsets the data to those that fit the limits defined. So, our data between 200 and 1000 is binned into 30 pieces using xlim.
Sometimes when we have outlier/weird/obviously-an-error data points, we will set those values to NA rather than drop the entire case. However, we have to think about how R is going to deal with missing data after that point.
mis_diamonds <- diamonds %>%
mutate(price_miss = ifelse(price > 750 & price < 1900, NA, price))
ggplot(mis_diamonds, aes(price_miss)) + geom_histogram(binwidth = 10) + theme_bw()
In a histogram, the missing values are just removed.
mis_diamonds <- diamonds %>%
mutate(color_miss = ifelse(color == "H", NA, as.character(color)))
ggplot(mis_diamonds, aes(as.factor(color_miss))) + geom_bar() + xlab("Color") + theme_bw()
In a bar chart, they’re plotted as NA, so you can see how many missing values you have.
na.rm = TRUE do in mean() and sum()?For both of these functions, na.rm = TRUE removes NA values before doing the calculation.
Covariation is the tendency for the values of two or more variables to vary together in a related way. We’ll look at this by checking out some plots.
Boxplots are good for looking at the relationship between one categorical and one continuous variable.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + theme_bw() + xlab("car type")
For two categorical variables, we can make one of those cool grids.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
When we are dealing with two continuous variables, we default to geom_point(). But it’s hard to read scatterplots when we have huge datasets. Other ways we can plot are with geom_bin2d() or geom_hex(). We can also bin a continuous variable and look at it in some boxplots.
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(cancelled, sched_dep_time)) + geom_boxplot() + theme_bw()
Generally, cancelled flights were scheduled to leave later than non-cancelled flights
I would expect that the variable most related to price is carat. Let’s look at how it relates to cut.
ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw()
It looks like the biggest diamonds are also the ones that are the biggest. That is probably why the lower quality diamonds are the most expensive.
ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?library(ggstance)
ggplot(diamonds, aes(carat, cut)) + geom_boxploth() + theme_bw()
ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw() + coord_flip()
It looks like ggstance allows you to list the aesthetic mappings in the opposite way.
lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?library(lvplot)
ggplot(diamonds, aes(cut, price)) + geom_lv(aes(fill=..LV..)) + theme_bw() + scale_fill_lv()
The LV plot is fine (I guess), but I’d rather just use a violin plot if I want to know something about distributions.
ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()
geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + facet_wrap(~cut)
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) + theme_bw()
Violin plots are great if you want to compare distributions side-by-side, as you can get a sense for central tendency as well as spread. I don’t like faceted histograms that much at all (esp. because that’s what a violin plot is, basically). The geom_freqpoly() is great for seeing relative frequencies, which are harder to see in violin plots.
geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.Not going to explain these, but it’s a cool package to point out.
# cut within color
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))
# color within cut
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))
flights %>%
group_by(dest, month) %>%
summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = dest, y = month)) +
geom_tile(mapping = aes(fill = m_delay))
First, there are a lot of destinations with no flights for a given month. We can change that by filtering. It’s also hard to read the continuous scale. I’ll change that by binning it.
flights %>%
group_by(dest) %>%
mutate(n_flights = n()) %>%
filter(n_flights > 15) %>%
group_by(dest, month) %>%
summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
mutate(delay_bin = ifelse(m_delay < 0, "Early",
ifelse(m_delay > 0 & m_delay < 30, "0-30 min late",
ifelse(m_delay >30 & m_delay < 60, "30-60 min late",
ifelse(m_delay > 60, "90 min late", NA))))) %>%
mutate(delay_bin = factor(delay_bin, levels = c("Early", "0-30 min late", "30-60 min late", "90 min late"))) %>%
ggplot(mapping = aes(x = as.factor(month), y = dest)) +
geom_tile(mapping = aes(fill = delay_bin)) + scale_fill_brewer(direction = 1) + xlab("Month")
It would be nice to have a way to filter the relevant destinations as well.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))
There are more colors, and it’s easier to read them along the x-axis.
cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?ggplot(data = diamonds,
mapping = aes(color = cut_number(carat, 5), x = price)) +
geom_freqpoly() + theme_bw()
cut_width() makes a certain number of bins, so you need to make sure they give you the right resolution on your data.
ggplot(data = diamonds,
mapping = aes(color = cut_width(carat, 1), x = price)) +
geom_freqpoly() + theme_bw()
cut_number() makes bins of a certain width, so again, think about how you want to think about your data.
ggplot(diamonds, aes(x = price, y = carat)) +
geom_boxplot(mapping = aes(group = cut_width(price, 1000))) + theme_bw()
Very small diamonds have a much smaller distribution than very large diamonds. I would assume that for small diamonds, there is only so much variation that quality can give – a really great tiny diamond probably costs something similar to a not-so-great tiny diamond.
ggplot(diamonds, aes(x = cut_number(price,5), y = carat, fill = cut)) +
geom_boxplot() + theme_bw()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Since most of the data are on the same line, it’s hard to read the binned plot here. We’re looking for individual outlier points, which are hard to see on this type of plot.
General ideas:
Keep everything you do in your analysis in the script. If you run things right in the console, assume you will forget them later.
Instead of using setwd() all the time, use an R project.
I think R projects are just basically directories that are structured in a standard way, where you use relative paths rather than absolute paths.
as.tibble works like as.data.frame() tibble() works like data.frame(), except that it doesn’t change the type of the input (e.g., string to factor)
You can use tribble() to make creating a data frame easier to read.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
## # A tibble: 2 x 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
By default, a tibble will only print the first 10 rows of a table and only those columns that fit on the screen. You can modify this if you want. width = Inf will print all the columns.
nycflights13::flights %>%
print(n = 10, width = Inf)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 819 11 UA 1545 N14228 EWR IAH 227
## 2 830 20 UA 1714 N24211 LGA IAH 227
## 3 850 33 AA 1141 N619AA JFK MIA 160
## 4 1022 -18 B6 725 N804JB JFK BQN 183
## 5 837 -25 DL 461 N668DN LGA ATL 116
## 6 728 12 UA 1696 N39463 EWR ORD 150
## 7 854 19 B6 507 N516JB EWR FLL 158
## 8 723 -14 EV 5708 N829AS LGA IAD 53
## 9 846 -8 B6 79 N593JB JFK MCO 140
## 10 745 8 AA 301 N3ALAA LGA ORD 138
## distance hour minute time_hour
## <dbl> <dbl> <dbl> <dttm>
## 1 1400 5 15 2013-01-01 05:00:00
## 2 1416 5 29 2013-01-01 05:00:00
## 3 1089 5 40 2013-01-01 05:00:00
## 4 1576 5 45 2013-01-01 05:00:00
## 5 762 6 0 2013-01-01 06:00:00
## 6 719 5 58 2013-01-01 05:00:00
## 7 1065 6 0 2013-01-01 06:00:00
## 8 229 6 0 2013-01-01 06:00:00
## 9 944 6 0 2013-01-01 06:00:00
## 10 733 6 0 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows
mtcars, which is a regular data frame).It will print differently – in particular, tibbles will say A tibble at the top.
data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?df <- data.frame(abc = 1, xyz = "a")
df_tibble <- tibble(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df_tibble$x
## NULL
In this one, the data.frame returned the most similar column (xyz), while the tibble returned an error. Tibbles don’t do partial matching. A partial match could lead to errors – e.g., if you think you are calling column x but actually it was never created, so instead you are calling xyz.
class(df[, "xyz"])
## [1] "factor"
class(df_tibble[, "xyz"])
## [1] "tbl_df" "tbl" "data.frame"
Here, the data.frame returns a vector of a factor with one level (a), while the tibble returns a character tibble.
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
df_tibble[, c("abc", "xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <chr>
## 1 1 a
In this case they return the same thing.
var <- "mpg", how can you extract the reference variable from a tibble?var <- "mpg"
mtcars %>% .[[var]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
annoying$`1`
## [1] 1 2 3 4 5 6 7 8 9 10
annoying %>%
ggplot(aes(`1`, `2`)) + geom_point()
annoying <- annoying %>%
mutate(`3` = `2`/`1`)
names(annoying) <- c("one", "two", "three")
annoying
## # A tibble: 10 x 3
## one two three
## <int> <dbl> <dbl>
## 1 1 1.26 1.26
## 2 2 5.15 2.57
## 3 3 7.18 2.39
## 4 4 7.00 1.75
## 5 5 9.76 1.95
## 6 6 13.4 2.23
## 7 7 13.9 1.99
## 8 8 14.6 1.83
## 9 9 17.6 1.96
## 10 10 19.7 1.97
tibble::enframe() do? When might you use it?enframe() takes a named vector and turns it into a two-column data frame. It is useful to deal with a lot of the output from stats methods.
You can use n_extra in print().
read_csv()skip = n lets you skip a few lines at the top of the file (good for E-Prime exports)comment = "#" will skip any lines that start with #.readr vs. base RReasons why read_csv() from readr is better than base R’s read.csv
parse_*() takes two arguments: a vector and a collection of strings to treat as NA.
To parse numbers, you can give arguments to locale – e.g., changing from 1.00 to 1,00.
Sometimes it’s useful to parse factors as character vectors until you’ve finished cleaning them up. ### Exercises
1.What function would you use to read a file where fields were separated with |?
read_delim(path_to_file, delim = "|")
## Error in is.connection(x): object 'path_to_file' not found
file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?col_names, col_types, locale, na, quoted_na, quote, trim_ws, skip, n_max, guess_max, progress.
Probably file and col_positions. col_positions says how wide each of the fixed-width columns are.
" or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?read_csv("x,y\n1,'a,b'", quote = "'")
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
read_csv("a,b\n1,2,3\n4,5,6")
## # A tibble: 2 x 2
## a b
## <int> <int>
## 1 1 2
## 2 4 5
read_csv("a,b,c\n1,2\n1,2,3,4")
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 NA
## 2 1 2 3
These ones have too many columns in one of the lower rows.
read_csv("a,b\n\"1")
## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
This one has an extra \.
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
This one is fine.
read_csv("a;b\n1;3")
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
This one needs to have the separator changed to ;.
locale()?The most important argument is probably date_names, where you can just set it to something like en or fr.
decimal_mark and grouping_mark to the same character?parse_number("123,456,789", locale = locale(grouping_mark = ",", decimal_mark = ","))
## Error: `decimal_mark` and `grouping_mark` must be different
You get an error.
What happens to the default value of grouping_mark when you set decimal_mark to ,?
parse_number("123,456,789", locale = locale(decimal_mark = ","))
## [1] 123.456
It changes to ..
What happens to the default value of decimal_mark when you set the grouping_mark to .?
parse_number("123,456,789", locale = locale(decimal_mark = "."))
## [1] 123456789
It changes to ,.
date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.It lets you set what the date or time format should be manually. E.g.,
str(parse_guess("01/02/2013", locale = locale(date_format = "%d/%m/%Y")))
## Date[1:1], format: "2013-02-01"
I live inside the US!
read_csv() and read_csv2()?read_csv() uses , for the separator and . for decimals. read_csv2() uses ; for the separator and , for the decimal mark.
ISO 8859 is used in Europe commonly (Greek, polish, Slovenian) as well as windows-1251 and windows-1254. Chinese uses Big5 and gb18030, Korean uses EUC-KR, Japanese uses Shift_JIS
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12
If your columns are not coming in as the right type, you can tell readr how to parse them manually.
If you want to save interim results, you can use write_rds and read_rds to preserve column types.